D50-D52 Nucleic Acids Research, 2014, Vol. 42, Database issue 
doi:10.1093lnarlgktl081 



Published online 21 November 2013 



Updates to BioSamples database at European 
Bioinformatics Institute 

Adam Faulconbridge, Tony Burdett, Marco Brandizi, Mikhail Gostev, Rui Pereira, 
Drashtti Vasant, Ugis Sarkans, Alvis Brazma and Helen Parkinson* 

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome 
Campus, Hinxton, Cambridge CB10 1SD, UK 



Received September 17, 2013; Revised and Accepted October 15, 2013 



ABSTRACT 

The BioSamples database at the EBI (http://www. 
ebi.ac.uk/biosamples) provides an integration point 
for BioSamples information between technology 
specific databases at the EBI, projects such as 
ENCODE and reference collections such as cell 
lines. The database delivers a unified query interface 
and API to query sample information across EBI's 
databases and provides links back to assay data- 
bases. Sample groups are used to manage 
related samples, e.g. those from an experimental 
submission, or a single reference collection. 
Infrastructural improvements include a new user 
interface with ontological and key word queries, a 
new query API, a new data submission API, 
complete RDF data download and a supporting 
SPARQL endpoint, accessioning at the point of sub- 
mission to the European Nucleotide Archive and 
European Genotype Phenotype Archives and 
improved query response times. 

INTRODUCTION 

The EBFs BioSamples database provides a single point of 
entry to sample data stored in EBI assay databases and 
delivers a dedicated query interface and API for accessing 
sample data. Samples are arranged into sample groups for 
ease of query, submission and to allow attributes to be 
added at the group level rather than at the sample 
specific level. This supports cases where values or attri- 
butes must be expressed as binned values across 
samples. This happens when the information is not avail- 
able or cannot be provided at the sample level for ethical 
reasons (1). When querying BioSamples users are offered 
hnks to assay databases, for example, to sequence infor- 
mation in the European Nucleotide Archive or ENA (2), 
gene expression microarrays in ArrayExpress (3) or prote- 
omics data PRoteomics IDEntitications database or 



PRIDE (4). The EBI's BioSamples database is developed 
in parallel with the NBCFs BioSamples database (5), 
which fulfils a similar function at the NCBI. This article 
describes data growth and new features implemented since 
our previous publication in 2011 (6). 

The EBI BioSamples database has doubled in size since 
January 2012 when 1 million samples were described in the 
BioSamples database, as of October 2013 2 846 137 
samples are available as 80232 groups. Data growth is 
attributed to new data sources, and increased volume of 
data from existing sources. New data sources include 
22 288 samples from The Cancer Genome Atlas (http:// 
cancergenome.nih.gov/), 920441 samples from the 
Catalogue of Somatic Mutation in Cancer — COSMIC 
(7); 920441 samples in 10 737 groups. Addition of 
samples from these sources provides interoperability 
between resources where, for example, COSMIC identi- 
fiers are included, e.g. in Ensembl (8). 

INFRASTRUCTURAL IMPROVEMENTS 

An updated web interface delivers new search functional- 
ity, improved tabular layout, ontology supported queries 
and access revised help documentation (see Figure 1). The 
experimental factor ontology (EFO) (9) is now for used 
query expansion using synonyms and child terms thus 
allowing more specific searches to be made. Users may 
select indexed key words or an EFO term for their 
queries in combination with Boolean Operators to refine 
their searches. EFO has been expanded in parallel to 
support BioSamples use cases, including an import of a 
'SLIM' of the Uber Anatomy Ontology Uberon (10) and a 
genetic disease classification from Orphanet (11). These 
were selected based on analysis of common user queries 
and provide enhanced queries for samples based on ana- 
tomical and disease characteristics. 

New programmatic interfaces are available for both 
retrieving and submitting data. The API supports 
queries by sample or sample group accession, and 
queries of samples, or sample groups by their attributes. 
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Figure 1. The new BioSamples web interface showing search results for a query of 'acute myeloid leukeinia'. Auto-complete on the search box 
suggests appropriate ontology and keyword terms as the query is entered by the user, including more specific terms from EFO such as subtypes of 
the disease. Highlighted results are colour coded for clarity; exact matches (yellow), synonyms (green) and more specific terms (red). 



For example, the URL http://www.ebi.ac.uk/biosamples/ 
xml/sample/SAME42581 returns all information about a 
sample in XML format. The APIs are documented on the 
BioSamples database help pages, and example queries are 
provided. 

Sample information can also be submitted to the 
BioSamples database through an API via a JSON repre- 
sentation of the SanipleTab tabular submission format. 
Part of the submission API is a SampleTab validation 
service, including a web page interface (http://www.ebi. 
ac.uk/biosamples/sampletab). This service is used for 
pre-submission validation of both manual and program- 
matic submissions. To maintain quahty for directly 
submitted sample information, use of the submission 
web service is restricted by API keys, which can be 
obtained from biosamples(«^ebi. ac.uk. 



DATA INTEGRATION 

The BioSamples database has improved interoperability 
with other EBI resources and with external groups. All 
submissions of sample information to ENA and 
European Genotype Phenotype Archive (http://www.ebi. 
ac.uk/ega) are now assigned a BioSamples Accession, 
which is returned to the submitter immediately as part 
of the submission process. Users can also pre-register 
sample(s) and re-use those accession(s) when submitting 
to resources such as ENA and EGA. By preregistering 
samples the BioSamples staff curates submitted data, 
and the BioSamples database becomes a single source of 
sample information across multiple experimental 
technologies and databases. This, in turn, encourages the 



use of BioSamples identifiers in other repositories to 
identify and hnk equivalent samples. 

Several major research projects have established hnks 
with the BioSamples database. For example, the HipSci 
project (http://www.hipsci.org/) pre-register information 
about donors and cell hues, including the relationships 
between them, with BioSamples database. The 
ENCODE (12) data coordination centre is working with 
BioSamples database to ensure their existing sample 
records are updated and annotated with ontology terms 
and in specifying relationships between samples in 
ENCODE datasets. To date sample information from 
users is submitted directly to the BioSamples database 
through both manual and automatic processes, both of 
which are supported by the curatorial staff. 

Other locations around the world have also estabhshed 
repositories of sample information, including the NCBI 
BioSample database. The BioSamples database at EBI is 
using a common accessioning scheme previously agreed 
with NCBI and DDBJ, and we expect that data 
exchange will be implemented in early 2014. 

As sample data can be identified by multiple accessions 
assigned by EBI and external databases an identifier 
and URL resolution service 'MyEquivalents' has been 
deployed. It provides mapping between different, but 
equivalent, sample identifiers. For example, human 
RNA-Seq data deposited at EBI may have identifiers for 
ArrayExpress, ENA and BioSamples database as these 
resources share records. In time the BioSamples database 
identifier will be the only sample identifier for new sub- 
missions, but until then, and to preserve backwards com- 
patibihty for legacy data, the MyEquivalents service 
provides redirection URLs and web services describing 
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mappings. More information and the source code for the 
MyEquivalents software is available (https://github.com/ 
EBIBioSampIes/myequivalents) . 

Finally, as a component of the EBI Resource 
Description Framework (RDF) platform (http://www. 
ebi.ac.uk/rdf) RDF is now available for the BioSamples 
database content. The schema is derived from the 
SampleTab format, supported by integration with 
existing ontologies such as the Ontology of Biomedical 
Investigations (13) and EFO. Data are made available as 
RDF and also for query via a SPARQL endpoint for 
which example queries are documented. 

FUTURE WORK 

The development of the process and tools supporting EBI- 
NCBl data exchange is underway in collaboration with 
NCBT EBI has completed a test parse and load of the 
current NCBI BioSamples database content and we are 
examining and mapping attributes used by the NBCI's 
and EBFs databases to deliver a core set of common at- 
tributes and context for these, for example, those required 
by standards such as The Minimal Information about a 
MetaGenome (14). The core attributes list will be used to 
facihtate data exchange, provide improved searches across 
attributes and drive context specific displays to ensure like 
attributes are displayed together for specific experiment 
types, e.g. latitude, longitude and depth for ocean 
samples. We will further improve our API and GUI 
access by implementing improved support for single 
sample level queries by technology and assay types. 
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